Data Science Tools: Numpy and Pandas

Both Numpy and Pandas are common tools used by Data Scientist.

Numpy
1. Multidimensional arrays and matrices
2. Mathematical functions
Pandas
1. Structure and manipulate data in ways well suited for data analysis

Numpy

Numpy is a library that is especially useful for when you want to work with large arrays and matrices of numeric data. Numpy is optimized to run fast, much faster than working with Python lists directly.

At the foundation of Numpy is the array object class. Numpy arrays are like python lists except that every element in the array has to be of the same type.



In [2]:

    
import numpy as np
array = np.array([1,2,3,4,5],float)



In [3]:

    
import numpy as np
array = np.array([1,2,3,4,5],float)
print array
matrix = np.array([[1,2,3,4],[5,6,7,8]],float)
print matrix









    



[ 1.  2.  3.  4.  5.]
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]

You can index, slice, and manipulate a numpy array much like you do Python lists



In [4]:

    
print array[1]
print array[:2]
print array[3:]









    



2.0
[ 1.  2.]
[ 4.  5.]



In [5]:

    
print matrix[1][1]
print matrix[1,:]
print matrix[:,1]









    



6.0
[ 5.  6.  7.  8.]
[ 2.  6.]

You can perform arithmetic operations on Numpy arrays



In [6]:

    
array1 = np.array([1,2],float)
array2 = np.array([3,4],float)

print array1 + array2
print array1 - array2
print array1 * array2
print array1 / array2









    



[ 4.  6.]
[-2. -2.]
[ 3.  8.]
[ 0.33333333  0.5       ]



In [7]:

    
matrix1 = np.array([[1,2],[3,4]],float)
matrix2 = np.array([[5,6],[7,8]],float)

print matrix1 + matrix2
print matrix1 - matrix2
print matrix1 * matrix2
print matrix1 / matrix2









    



[[  6.   8.]
 [ 10.  12.]]
[[-4. -4.]
 [-4. -4.]]
[[  5.  12.]
 [ 21.  32.]]
[[ 0.2         0.33333333]
 [ 0.42857143  0.5       ]]

In addition to arithmetic operations, Numpy has a number of other mathematical operations you can apply (e.g., mean, median, standard deviation, dot product)



In [8]:

    
# Functions useful for statistical analysis
array1 = np.arange(1,6)

print array1
print np.mean(array1)
print np.median(array1)
print np.std(array1)









    



[1 2 3 4 5]
3.0
3.0
1.41421356237

Refresher Dot product formula: a = [a1,a2,a3] b = [b1,b2,b3] dot(a,b) = a1xb1 + a2xb2 + a3xb3 For more info on Numpy.dot go to: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html



In [9]:

    
# Dot product
array2 = np.arange(3,8)
print array1
print array2

print(np.dot(array1,array2))









    



[1 2 3 4 5]
[3 4 5 6 7]
85



In [10]:

    
# Dot product (matrix)
oneDArray= np.array([1,2])
twoDArray = np.array([[2,4,6],[3,5,7]])

print np.dot(oneDArray,twoDArray)

Pandas

Data can be stored and referenced using Pandas.

Data in Pandas is often stored in a structure called the dataframe. A dataframe is a 2D labled data structure with columns that can be different types.

Dataframe



In [11]:

    
from pandas import DataFrame, Series



In [12]:

    
# Create a dictionary where the key = name of columns and the value is a corresponding series
# Value -> 1st pass the data you want in dataframe and index where you want the data to go
#
# You can think about a Series as a one-dimensional object that is similar to an array, 
# list, or column in a database.  By default it will assign an index label to each to each 
# item in the Series ranging from zero to N-1, where N is the number of items in the series.
#
data = {'name':Series(['Braund','Cummings','Heikkinen','Allen'],index = ['a','b','c','d']),
        'age':Series([22,38,26,35],index = ['a','b','c','d']),
        'Fare':Series([7.25,71.83,8.05],index = ['a','b','d']),
        'Survived':Series([False,True,True,False],index = ['a','b','c','d'])}



In [13]:

    
# Pass data as a argument to the data frame function to create the actual data frame
df = DataFrame(data)
print df









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]

Accesing Data Column Data



In [14]:

    
# You can operate on specific columns by calling them as if they were a key in a dictionary.
# You can access one column.  When you do, you get a Series
print df['name']









    



a       Braund
b     Cummings
c    Heikkinen
d        Allen
Name: name, dtype: object



In [15]:

    
# You can access multiple columns by passing list of column names.  
# When you do, you get back a dataframe.
df[['name','age','Survived']]









    Out[15]:






  
    
      
      name
      age
      Survived
    
  
  
    
      a
          Braund
       22
       False
    
    
      b
        Cummings
       38
        True
    
    
      c
       Heikkinen
       26
        True
    
    
      d
           Allen
       35
       False
    
  

4 rows × 3 columns

Subseting dataframe using true/false statements on columns



In [16]:

    
print df

# Select rows in which the passanger age >= 30
print df[df['age']>= 30]









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
b  71.83     True   38  Cummings
d   8.05    False   35     Allen

[2 rows x 4 columns]



In [17]:

    
# You can also perform the above operation on particular columns
# Example: Get 'Survived' information for passagengers whose age >= 30
# Notes: 
# df['Survived'] only picks out data from the 'Survived' column 
# df['Survived'][df['age']>= 30] picks indices where this statment 'df[age] > 30' is true.
print df
print df['Survived'][df['age'] >= 30]









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
b     True
d    False
Name: Survived, dtype: bool

Pandas Row Indexing

You can access rows through multiple ways

An individual index (through the functions loc and iloc)
Slicing
Boolean indexing

You could also combine multiple selection requirements through boolean operators like & (and) and | (or)

Index (loc,iloc)



In [18]:

    
print df

# Get row corresponding to passenger "Braund", whose index is a
print df.loc['a']
print ""

# You can also access via integer position
print df.iloc[[1]]









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
Fare          7.25
Survived     False
age             22
name        Braund
Name: a, dtype: object

    Fare Survived  age      name
b  71.83     True   38  Cummings

[1 rows x 4 columns]

Boolean Indexing



In [19]:

    
print df

# Example: find the passengers with age >= 30
print df[df['age']>= 30]









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
b  71.83     True   38  Cummings
d   8.05    False   35     Allen

[2 rows x 4 columns]

Slicing



In [20]:

    
print df

print df[0:2]









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
a   7.25    False   22    Braund
b  71.83     True   38  Cummings

[2 rows x 4 columns]

Boolean Operations



In [21]:

    
print df

# Multiple selection using & (and)
print df[(df.Survived == True) & (df.age > 30)]









    



    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
b  71.83     True   38  Cummings

[1 rows x 4 columns]

It is possible to perform boolean indexing on specific columns

Pandas Descriptive Functions

Pandas also has various functions that help you understand some basic information about your dataframe. Some of this functions are:

dtypes: to get the data type for each column
describe: useful for seeing basic statistics of numeric columns in your dataframe
head: displays the first 5 rows in your dataset
tail: displays the last 5 rows in your dataset



In [22]:

    
print df.dtypes
print ""
print df.describe()









    



Fare        float64
Survived       bool
age           int64
name         object
dtype: object

            Fare   Survived    age
count   3.000000          4   4.00
mean   29.043333        0.5  30.25
std    37.056499  0.5773503   7.50
min     7.250000      False  22.00
25%     7.650000          0  25.00
50%     8.050000        0.5  30.50
75%    39.940000          1  35.75
max    71.830000       True  38.00

[8 rows x 3 columns]

Pandas Vectorization

What does it mean to operate on a dataframe in a vecotired way?



In [23]:

    
# First, lets create a dataframe
d = {'one': Series([1,2,3], index = ['a','b','c']),
     'two': Series([1,2,3,4], index = ['a','b','c','d'])}

df = DataFrame(d)

print df









    



   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4

[4 rows x 2 columns]



In [24]:

    
# Second, we apply an arbitrary function to all the columns
# in the dataframe using pf.apply.
# The result itself is a new dataframe

# Example: apply numpy.mean to every column (axis = 0 for column operation, axis = 1 for row operation)

df.apply(np.mean, axis = 0)









    Out[24]:





one    2.0
two    2.5
dtype: float64

map



In [25]:

    
# You can apply map to a column in the dataframe or the entire dataframe.
# This functions will allow you to apply functions that take in a single value and returns
# a single value

# Example
print df['one']

#Go through every single value in 'One' and applies lambda function
print df['one'].map(lambda x: x> 1) 

# Refresher:
# Lambda functions are small inline functions that are 
# defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1, 
# or a boolean that equals True or False.
# For more info go to: https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions









    



a     1
b     2
c     3
d   NaN
Name: one, dtype: float64
a    False
b     True
c     True
d    False
Name: one, dtype: bool



In [26]:

    
# The function can be applied to all the columns of the dataframe
print df
print df.applymap(lambda x: x>1)









    



   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4

[4 rows x 2 columns]
     one    two
a  False  False
b   True   True
c   True   True
d  False   True

[4 rows x 2 columns]

Note: map() can only be used on a Series to return a new Series and applymap() can only be used on a DataFrame to return a new DataFrame.

Exercise #1



In [27]:

    
'''
Compute the average of bronze medals earned by countries who earned at least one gold medal.
Save this to a variable called: avg_bronze_at_least_one_gold

HINT-1:
    You can retrieve all of the values of a Pandas column from a 
    data frame, "df", as follows:
    df['column_name']
    
    HINT-2:
    The numpy.mean function can accept as an argument a single
    Pandas column. 
    
    For example, numpy.mean(df["col_name"]) would return the 
    mean of the values located in "col_name" of a dataframe df.
'''
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
    
olympic_medal_counts = {'country_name':Series(countries),
                            'gold': Series(gold),
                            'silver': Series(silver),
                            'bronze': Series(bronze)}

df = DataFrame(olympic_medal_counts)

np.mean(df['bronze'][df['gold']>0])
np.mean(df[df['gold']>0]['bronze'])









    Out[27]:





4.2380952380952381

Exercise #2

Using the dataframe's apply method, create a new Series called avg_medal_count that indicates the average number of gold, silver, and bronze medals earned amongst countries who earned at least one medal of any kind at the 2014 Sochi olympics.



In [28]:

    
# Original data frame
print df









    



    bronze    country_name  gold  silver
0        9    Russian Fed.    13      11
1       10          Norway    11       5
2        5          Canada    10      10
3       12   United States     9       7
4        9     Netherlands     8       7
5        5         Germany     8       6
6        2     Switzerland     6       3
7        1         Belarus     5       0
8        5         Austria     4       8
9        7          France     4       4
10       1          Poland     4       1
11       2           China     3       4
12       2           Korea     3       3
13       6          Sweden     2       7
14       2  Czech Republic     2       4
15       4        Slovenia     2       2
16       3           Japan     1       4
17       1         Finland     1       3
18       2   Great Britain     1       1
19       1         Ukraine     1       0
20       0        Slovakia     1       0
21       6           Italy     0       2
22       2          Latvia     0       2
23       1       Australia     0       2
24       0         Croatia     0       1
25       1      Kazakhstan     0       0

[26 rows x 4 columns]



In [29]:

    
# First:  Get subset of dataframe that only contains medal information
df1 = df[['gold','silver','bronze']]
print df1









    



    gold  silver  bronze
0     13      11       9
1     11       5      10
2     10      10       5
3      9       7      12
4      8       7       9
5      8       6       5
6      6       3       2
7      5       0       1
8      4       8       5
9      4       4       7
10     4       1       1
11     3       4       2
12     3       3       2
13     2       7       6
14     2       4       2
15     2       2       4
16     1       4       3
17     1       3       1
18     1       1       2
19     1       0       1
20     1       0       0
21     0       2       6
22     0       2       2
23     0       2       1
24     0       1       0
25     0       0       1

[26 rows x 3 columns]



In [30]:

    
# Second:  Keep rows in which there is at least 1 medal 
df2 = df1[(df1['gold']>0) | (df1['silver']>0) | (df1['bronze']>0)]  
print df2









    



    gold  silver  bronze
0     13      11       9
1     11       5      10
2     10      10       5
3      9       7      12
4      8       7       9
5      8       6       5
6      6       3       2
7      5       0       1
8      4       8       5
9      4       4       7
10     4       1       1
11     3       4       2
12     3       3       2
13     2       7       6
14     2       4       2
15     2       2       4
16     1       4       3
17     1       3       1
18     1       1       2
19     1       0       1
20     1       0       0
21     0       2       6
22     0       2       2
23     0       2       1
24     0       1       0
25     0       0       1

[26 rows x 3 columns]



In [31]:

    
# Third: Get average number of gold, silver, and bronze  
avg_medal_count = Series(df2.apply(np.mean))

print avg_medal_count









    



gold      3.807692
silver    3.730769
bronze    3.807692
dtype: float64

Exercise #3

Imagine a point system in which each country is awarded 4 points for each gold medal, 2 points for each silver medal, and 1 point for each bronze medal.

Using the numpy.dot function, create a new dataframe that includes: a) a column called 'country_name' with the country name b) a column called 'points' with the total number of points the country earned at the Sochi olympics



In [33]:

    
# Create an array with point system
awards = np.array([4,2,1])

data = {'country_name':df['country_name'],
        'points':Series(df[['gold','silver','bronze']].apply(lambda x:np.dot(awards,x),axis = 1))}

newDF = DataFrame(data)
print newDF









    



      country_name  points
0     Russian Fed.      83
1           Norway      64
2           Canada      65
3    United States      62
4      Netherlands      55
5          Germany      49
6      Switzerland      32
7          Belarus      21
8          Austria      37
9           France      31
10          Poland      19
11           China      22
12           Korea      20
13          Sweden      28
14  Czech Republic      18
15        Slovenia      16
16           Japan      15
17         Finland      11
18   Great Britain       8
19         Ukraine       5
20        Slovakia       4
21           Italy      10
22          Latvia       6
23       Australia       5
24         Croatia       2
25      Kazakhstan       1

[26 rows x 2 columns]



In [37]:

    
dp_function = lambda x:np.dot(awards,x);

data = {'country_name':df['country_name'],
        'points':Series(df[['gold','silver','bronze']].apply(dp_function,axis = 1))}

newDF = DataFrame(data)
print newDF









    



      country_name  points
0     Russian Fed.      83
1           Norway      64
2           Canada      65
3    United States      62
4      Netherlands      55
5          Germany      49
6      Switzerland      32
7          Belarus      21
8          Austria      37
9           France      31
10          Poland      19
11           China      22
12           Korea      20
13          Sweden      28
14  Czech Republic      18
15        Slovenia      16
16           Japan      15
17         Finland      11
18   Great Britain       8
19         Ukraine       5
20        Slovakia       4
21           Italy      10
22          Latvia       6
23       Australia       5
24         Croatia       2
25      Kazakhstan       1

[26 rows x 2 columns]



In [32]:

References

Numpy Reference Guide
1. http://docs.scipy.org/doc/numpy/reference/
Pandas Documentation
1. http://pandas.pydata.org/pandas-docs/stable/
Pandas tutorial using Ipython notebooks
1. https://bitbucket.org/hrojas/learn-pandas